String Metrics and Word Similarity applied to Information Retrieval
نویسندگان
چکیده
Over the past three decades, Information Retrieval (IR) has been studied extensively. The purpose of information retrieval is to assist users in locating information they are looking for. Information retrieval is currently being applied in a variety of application domains from database systems to web information search engines. The main idea of it is to locate documents that contain terms the users specify in their queries. The thesis presents several string metrics, such as edit distance, Q-gram, cosine similarity and dice coefficient. All these string metrics are based on plain lexicographic term matching and could be applied to classical information retrieval models such as vector space, probabilistic, boolean and so on. Experiment results of string distance metrics on real data are provided and analyzed. Word similarity or semantic similarity relates to computing the similarity between concepts or senses of words, which are not lexicographically similar. WordNet, can be classified into two categories: one uses solely semantic links, the other combines corpus statistics with taxonomic distance. Five similarity measures belonging to these two categories are selected to conduct the experiment on the purpose of comparison. Hierarchical clustering algorithms including both single-linkage clustering and complete-linkage clustering are studied by employing word similarity measures as clustering criteria. Stopping criteria including Calinski & Harabasz, Hartigan and WB-index are used to find the proper hierarchical level in the clustering algorithms. Experiments on both synthetic datasets and real datasets are conducted and the results are analyzed. Acknowledgements I would like to thank Dr. Pasi Fränti for the advice, encouragement and support he provided to me in supervising this thesis eort. I would also like to thank Qinpei Zhao, for her critical analyses, technical advice and recommendations. Special thanks go to Dr. Olli Virmajoki for his reviews and recommendations.
منابع مشابه
An Intelligent System for Exact Word Retrieval in Document Databases
Automatic Information retrieval from document image databases is an important and challenging task. The main challenges are font style, size and spacing between characters. In order to meet the challenges, we propose a new technique for matching exact word string from document databases. For this approach, we address two issues: word identification and similarity measurement between documents. ...
متن کاملReview of ranked-based and unranked-based metrics for determining the effectiveness of search engines
Purpose: Traditionally, there have many metrics for evaluating the search engine, nevertheless various researchers’ proposed new metrics in recent years. Aware of this new metrics is essential to conduct research on evaluation of the search engine field. So, the purpose of this study was to provide an analysis of important and new metrics for evaluating the search engines. Methodology: This is ...
متن کاملA Survey of Text Similarity Approaches
Measuring the similarity between words, sentences, paragraphs and documents is an important component in various tasks such as information retrieval, document clustering, word-sense disambiguation, automatic essay scoring, short answer grading, machine translation and text summarization. This survey discusses the existing works on text similarity through partitioning them into three approaches;...
متن کاملTextual Entailment as a Directional Relation
This paper presents three methods for solving the problem of textual entailment, obtained from an equal number of text-to-text similarity metrics. The first method starts with the directional measure of text-to-text similarity presented in Corley and Mihalcea (2005), and integrates word sense disambiguation and several heuristics. The second method exploits the relations between the cosine dire...
متن کاملIdea-deriving Information Retrieval System
This paper presents the information retrieval system that integrates concept-based retrieval, which focuses on the similarity of the meanings of words, with the characterstring-matching-based retrieval. The system provides a word association function, a concept retrieval function, and a document classification function, which are expected to help the user to reach the target document quickly or...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2012